October 21, 2016, Hopkins Marine Station, Stanford University

Exposure and confidence

  • My origins in programming, data science, and open science
  • Improving reproducibility, collaboration and communication in environmental science with open science tools
  • Resources and recommendations

Reproducibility

Reproducibility is foundatational to science, but we rarely test it, even with our own work.

Fig of headlines (does this slide fit here)

Data science and open science

Data Science:

"an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge" (Grolemund & Wickham 2016)

Data science and open science

Data Science:

"an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge" (Grolemund & Wickham 2016)

Data science and open science

Data Science:

"an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge" (Grolemund & Wickham 2016)

Open Science:

"the concept of transparency at all stages of the research process, coupled with free and open access to data, code, and papers" (Hampton et al. 2014)

Data science and open science tools

My origins story

Some thesis questions

  • what are Humboldt squid habitat preferences?
  • what season are they most abundant?
  • how fast and far can they migrate?
  • how do they interact with other species?
  • how do I import my data when it's too big for Excel?
  • how do I subset years or other attributes?
  • how on earth do I visualize any of this?

Conflated questions

Science:

  • what are their habitat preferences?
  • what season are they most abundant?
  • how fast and far can they migrate?
  • how do they interact with other species?


Data science:

  • how do I import my data when it's too big for Excel?
  • how do I subset years or other attributes?
  • how on earth do I visualize any of this?

I learned to program like many do

  • in a panic
  • for a single purpose (get this thesis done!)
  • in near-isolation*


I learned to program like many do

  • in a panic
  • for a single purpose (get this thesis done!)
  • in near-isolation*


* Thankfully, I had wonderful programming mentors:

Steve Haddock, Dave Foley, Ashley Booth

NCEAS, UC Santa Barbara

TODO: image

Ocean Health Index

  • method to categorize benefits that oceans provide to people
  • scores are modeled using existing data
  • method can be tailored to different geographies
  • can help inform policy decisions, especially when repeated

OHI Global Assessments

OHI Global Assessments

2013: second annual global assessment

  • repeat methods
  • update data
  • compare between years

OHI Global Assessments

2013: second annual global assessment

  • repeat methods
  • update data
  • compare between years


We expected to easily reproduce our previous work. We had planned ahead:

  • coded models
  • 130 pages of published supplemental material
  • internal documents and notes

We thought we were doing reproducible science

We struggled to reproduce our work using standard approaches to reproducibility and collaboration

  • added challenge of managing multiple years of information and scores
  • we needed a nimble approach to sharing data, methods, and results within and outside our team

Lowndes et al. in prep: Improving reproducibility, collaboration, and communication in environmental science using open science tools

underscore the importance of:

  • exposure and confidence
  • evolution rather than revolution

We identified three main challenges to overcome

  1. reproducibility, including transparency and repeatability, particularly in data preparation
  2. collaboration, including teamwork and internal collaboration
  3. communication with scientific and public communities

Addressing challenges using open science tools

Reproducibility - data preparation

"Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information." - NYTimes (2014)

  • transforming, rescaling, gap-filling, formatting, etc
  • seldom mentioned but underpins the scientific process

Reproducibility - data preparation

"Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information." - NYTimes (2014)

  • transforming, rescaling, gap-filling, formatting, etc
  • seldom mentioned but underpins the scientific process

Ultimate goal: Tidy data (Wickham 2014)

Reproducibility - data preparation

Reproducibility - data preparation

Before

  • manually (without coding)
  • largely Microsoft Excel
  • internal documents and emails

After

  • full process coded
    • R with documentation
    • RMarkdown

Reproducibility - data preparation

Reproducibility - version control

TODO: version control quote

Reproducibility - version control

Before

  • filenames suffixed with dates, initials (e.g. final.csv and final_JL-2016-08-05.csv)
  • email descriptions of what changed between files

After

  • version control with git
  • short messages accompany commited changes

Collaboration

TODO: Collaboration quote

Collaboration

Before

  • team structure: scientists and single programmer
  • discrete tasks:
    • scientists developed the models conceptually, gathered data, prepared data, interpreted modeled results
    • programmer coded the models.
  • communication + file sharing: email chains (often forwarded).
  • organization: individual conventions on local computers

After

  • team structure: data scientists
  • workflow: (simplified) GitHub workflow

Communication

Work in progress

  • incremental
  • always improving, learning
  • teaching and training, support

OHI Today

These tools and this workflow make our work possible.

  • December 8 2016: releasing 5th global assessment
  • Support and training for ~26 government or academic 'OHI+' assessments

All on ohi-science.org

My recommendations

1 - Learn to code

    - in R
    - with RStudio

2 - Use version control

    - git
    - with GitHub
    - through RStudio

Introduce these concepts incrementally: evolution not revolution

Great resources

Secrets of becoming a programmer

–> TODO: quote from Woo et al

Demos of Main tools

  • R and RStudio for coding and visualization
  • Git for version control, GitHub for collaboration
  • GitHub + RStudio for organization, documentation, online publishing, distribution, and communication ohi-science.org

Extra slides

Improving using open science tools

reproducibility

  • data preparation: coding and documenting
  • modeling: R functions and packages
  • version control
  • organization

collaboration

  • teamwork
  • (simplified) GitHub workflow

communication

  • sharing data and code
  • sharing methods and instruction

My origins in programming and data science

  • as a grad student I learned due to panic
  • just needed to get the job done
  • PCfB
  • resistant to version control